Assignment 2 - Making Decisions with Data

Student Name
October 17, 2017

This assignment covers experiment study design, hypothesis development, common bias and error issues as well as simple predictive analytics using regression analysis and decision trees.

The objectives of this assignment are:

  • Understand experiment types and design
  • Understand common errors and bias factors
  • Simple predictive analytics
  • Using Python Notebooks to create simple linear regression models

Question 1 - Experiment Design

For each item below, write a short description of the following project activities:

a) Briefly describe some common issues in experiment design


In [ ]:

b) You are assigned to a new drug study using 40 mice (20m / 20f) where half would be treated and the other half are untreated. Also, the procedure is complex and is limited to (4) mice per day. What would be the most efficient approach to assign the mice by group and day?


In [ ]:

c) Briefly describe some of the difference between Experiment vs. Observational studies:


In [ ]:

d) Create a simple experiment design for the following:

Study: Plants grow more when exposed to classical music

Review and complete the following questions:

  1. How would you design a study for this?
  2. How will you select the significant factors?
  3. How many samples/subjects will you need?
  4. Briefly describe your experiment approach:

In [ ]:

Question 2 - Experimental Error Types

a) Briefly describe Type 1 experimental errors and provide two (2) examples:

b) Briefly describe Type 2 experimental errors and provide two (2) examples:

Question 3 - Hypothesis Defintion

Review the following study description:

A renown doctor claims that 17 year olds have an average body temperature higher than the average (98.6 f). After conducting a random statistical sample of 25 17 year olds, the average temperature is found to be (98.9 f) with a standard deviation of 0.6 degrees f

a) Identify the hypothesis in this study:


In [ ]:

b) Define the null hypothesis (H0) for this study:


In [ ]:

c) Identify any alternate hypothesis (Ha) applicable for this study:


In [ ]:

Question 4: Causation

For each of the following statements, assign a value from the list below:

a. There are confounding variables
b. It is unclear which variable is the cause and which is the effect
c. It is unreasonable to generalize from the sample studied
d. The variables actually measured are not related to the effect
e. No plausible alternative explanation exists

1. A County inspector found that 35% of the sprinklers failed to activate under 7 pounds of pressure. However, the manufacturer maintains that the 7 psi threshold for passing or failing does not reflect typical water pressure in sprinkler systems.


In [ ]:

2. Samoans have been reported having increasing incidence of violence as measured by the steadily increasing number of television sales.


In [ ]:

3. A new study suggests that women have lower abilities to be orchestra conductors since only a small percentage of orchestra conductors are women.


In [ ]:

Question 5 - Regression Analysis

a. Briefly describe the difference between linear and logistic regression:

Linear Regression:

Logistic Regression:

b. Review and run the following code to create a linear regression model:


In [ ]:
# Import libraries -- install as needed
import numpy as np
import pandas as pd
import matplotlib as mpl
import matplotlib.pyplot as plt

%matplotlib inline

In [ ]:
# Load Boston house prices sample dataset
from sklearn.datasets import load_boston
boston = load_boston()

dataset = pd.DataFrame(boston.data, columns=boston.feature_names)
dataset['PRICE'] = boston.target
dataset.head()

In [ ]:
# Create scatter plot showing relationship
plt.scatter(dataset.RM, dataset.PRICE)
plt.xlabel('Average number of rooms (RM)')
plt.ylabel("House Price")
plt.title("Regression analysis of RM and PRICE")

In [ ]:
# Create Linear Regression model
from sklearn.linear_model import LinearRegression

# Create training dataset
x_param = dataset.drop('PRICE', axis=1)

lm = LinearRegression()
lm.fit(x_param, dataset['PRICE'])

print('Regression intercept coefficient: {}'.format(lm.intercept_))
print('Regression number of coefficients: {}'.format(len(lm.coef_)))

In [ ]:
# Plot predicted house price comparison
plt.scatter(dataset.PRICE, lm.predict(x_param))
plt.xlabel("Prices: $Y_i$")
plt.ylabel("Predicted prices: $\hat{Y}_i$")
plt.title("House prices vs regression")

Question 6 - Regression Error Analysis

a. Review the following to estimate the mean squared error:


In [ ]:
# Calculate mean squared error
mse = np.mean((dataset.PRICE - lm.predict(x_param)) ** 2)
mse

b. Review the following to estimate the residuals:


In [ ]:
# Calculate residuals
residuals = dataset['PRICE'] - lm.predict(x_param)

print("Head of residual: {}".format(residuals[:5]))
print("Mean of residual: {}".format(np.mean(residuals)))
print("SD of residual: {}".format(np.std(residuals)))

Question 7 - Regression Analysis

a. Using Question 5 as an example, write the code to complete a simple regression analsysis on the following data set.


In [ ]:
# Download California house prices sample dataset -- need to have internet connection.
from sklearn.datasets import fetch_california_housing
cal = fetch_california_housing()

df_cal = pd.DataFrame(cal.data, columns=cal.feature_names)
df_cal['PRICE'] = cal.target
df_cal.head()

In [ ]:
# Create scatter plot showing relationship of AveRooms (AR) and PRICE

In [ ]:
# Create create training dataset and fit linear model

In [ ]:
# Create predicted vs actual house price comparison

b. Using Question 6 as an example, write the code for a simple regression error analysis.


In [ ]:
# Print Mean Squared Error

In [ ]:
# Print Residuals

Question 8 - (Optional) - Understanding Probability

In an act of mercy, the Emperor offers a prisoner a trial to pick one pebble from two (2) bowls. There are fifty (50) white pebbles and fifty (50) black pebbles. The prisoner is blindfolded and must choose only one (1) pebble. If the prisoner chooses a white pebble he will be freed, but if he chooses a black pebble he will be immediately executed.

Describe how the pebbles should be distributed to ensure the highest chance of survival:


In [ ]:

All Done